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Q I Abstract 

' Neigliborliood graphs are gaining popularity as a concise data representation in machine 

O . learning. However, naive graph construction by pairwise distance calculation takes O(n^) 

runtime for n data points and this is prohibitively slow for millions of data points. For 
strings of equal length, the multiple sorting method (Uno, 2008) can construct an e-neighbor 
^ ' graph in 0{n-\-m) time, where m is the number of e-neighbor pairs in the data. To introduce 

this remarkably efficient algorithm to continuous domains such as images, signals and texts, 
we employ a random projection method to convert vectors to strings. Theoretical results 
are presented to elucidate the trade-off between approximation quality and computation 
time. Empirical results show the efficiency of our method in comparison to fast nearest 
neighbor alternatives. 
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1. Introduction 



Neighborhood graphs are widely used in various machine learning and data mining tasks 
such as manifold modeling (Tennenbaum et al., 2000), semi-supervised learning (Zhou et al., 
5^ I 2004), spectral clustering (Hein et al., 2007), and retrieval of protein sequences (Weston 

et al., 2004). There are two types of neighborhood graphs, /c-nearest neighbor graphs (a 
node is connected to its k nearest neighbors) and e-neighbor graphs (two nodes whose 
distance is within e are connected). Naive construction of neighborhood graphs take 0(n-^) 
distance calculations, where n denotes the number of data points. This is prohibitively slow 
in recent large-scale applications. This paper deals with fast methods to create e-neighbor 
graphs. To be precise, let us define our problem as follows. 

Definition 1.1 (e-Neighbor Graph Construction) Given a set of vectors xi, . . . ,a;„ G 
K*^, enumerate all pairs < j such that A{xi,Xj) < e, where A is a metric such as 

Euclidean distance and cosine distance. 

The problem of e-neighbor graph construction is related to, but distinctly different from 
the e-neighbor search. 
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Definition 1.2 (e-Neighbor Search) Given a query point v and a set of vectors, enu- 
merate all neighbors Xi satisfying A(v, ajj) < e. 

The problem of e-neighbor graph construction is simpler than e-neighbor search, because any 
e-neighbor search algorithm can solve e-neighbor graph construction by applying the algo- 
rithm to every point, but not vice versa. Conventionally, however, the neighborhood graphs 
have been constructed by nearest neighbor search methods (Datar et al., 2004; Beygelzimer 
et al., 2006; Omohundro, 1991). There are two kinds of methods, exact and approximate. 
Exact methods including cover tree (Beygelzimer et al., 2006), ball tree (Omohundro, 1991) 
and kd-tree (Lee and Wong, 1977) can find all neighbors without fail. Approximate methods 
such as E2LSH (Datar et al., 2004) use random projections called locality sensitive hashing 
(LSH) to classify the points into small cells. Neighbors of a query point are sought only in 
the cell that includes the query point. Approximate methods can miss a small fraction of 
neighbors, but they are theoretically faster than the exact alternatives. In practical prob- 
lems, however, such approximate methods do not compare favorably with exact methods 
with respect to running time, see e.g., (Beygelzimer et al., 2006). 

Our approach aims to solve the neighbor graph construction problem directly by the 
following two steps. First, data points are mapped to strings of discrete symbols using 
LSH. Through this mapping, the closeness between two vectors is approximately preserved 
as the Hamming distance between strings. Subsequently, all pairs whose Hamming distance 
is within d are enumerated by the multiple sorting method (MSM) (Uno, 2008), which is 
an exact enumeration method based on masked radix sort. Since our algorithm includes 
stochasticity in the random projection part, missing edges can occur in the neighborhood 
graph with a certain probability. Nevertheless our algorithm can explicitly control the 
expected fraction of missing edges {missing edge ratio) under a small value, e.g., 1.0 x 10~^. 

Since our method uses LSH, one might worry that the practical slowness of LSH-based 
neighbor search carries on in our method. However, the crucial difference is that the existing 
LSH methods try to accomplish the search by random projection alone. Namely, one has to 
design the projection such that neighbors are mapped to the same string exactly. In general, 
it is hard to create single mapping to achieve this. In common practice, many replicates 
of strings are used to boost the true discovery rate (Datar et al., 2004). Such a scheme 
results in very redundant strings: Typically, more than 100 replicates of relatively short 
strings (length 10-50) are used to reach a reasonable level of accuracy (Andoni and Indyk, 
2005). See (Weiss et al., 2009) for related discussions. On the other hand, in our method, 
we design the projection such that neighbors are mapped to similar strings, not identical 
(more precisely, the Hamming distance is smaller than d). So, the number of necessary 
replicates is typically much smaller than the existing LSH-based neighbor search. 

In empirical evaluation based on 1.6 million images (Torralba et al., 2008), our method 
was significantly faster than the cover tree while keeping the missing edge ratio under 
1.0 X 10~^. It will be shown that good efficiency cannot be attained, if we try to solve the 
problem by LSH alone. 

The rest of this paper is organized as follows. Section 2 reviews the multiple sorting 
method for strings. In Section 3, locality sensitive hashing is introduced and the drawbacks 
of existing LSH-based neighbor search arc discussed. In Section 4, we present our approach 
and evaluate the missing edge ratio. Section 5 presents empirical results. Section 6 presents 
further discussion and closing remarks. 



2 



Multiple Sorting Method 



EMILY 

DAVID 

CHRIS 

ALICE 

DAVID 

BOBBY 

DAVID 

ALICE 



sort 



ALICE 

ALICE 

BOBBY 

CHRIS 

DAVID 

DAVID 

DAVID 

EMILY 



Equivalence 
Classes 



Figure 1: Sorting and equivalence classes. 



2. Multiple Sorting Method 

In this section, we review the multiple sorting method (MSM) (Uno, 2008). 
2.1 Basic Idea 

Consider the problem of constructing the d-neighbor graph of string data of equal length 
I. As the metric, we use the Hamming distance, i.e., the number of disagreements. The 
problem is formulated as follows: Given a string pool S = find all pairs 

i < j whose Hamming distance is at most d, HamDist{si, Sj) < d. The multiple 
sorting method (MSM) (Uno, 2008) can solve the problem in 0{n + m) time, where m is 
the number of d-neighbor pairs, 

m = I HamDist{si, Sj) < d}\. (1) 

To explain the idea of MSM intuitively, let us start from the special case d = 0, that is, 
enumerating exactly same string pairs. In that case, the problem is solved by sorting the 
strings and scanning the sorted list to divide it into equivalence classes (Figure 1). Then, 
for each equivalence class, edges are built between all pairs. Using radix sort, sorting takes 
only 0(n) time. The edge building takes 0{m) time. So the overall complexity is 0{n + m). 

Even if d > 0, we can enumerate neighbor pairs by applying radix sort multiple times. 
Let C denote a set of d distinct integers taken from {1, . . . Denote by sp the i-th string 
whose characters at positions C are removed. We call them C-masked strings. Obviously, 
the following two statements are equivalent. 

• There exists C such that sp = s^, |C| = d. 

• HamDist{si, Sj) < d. 

Therefore, the neighbor pairs can be enumerated by trying every possible C of size d and 

/ £ \ 

sorting the masked strings. It takes ( ^ j times sorting, hence the time complexity is 

polynomial to I and exponential to d. Nevertheless, in terms of n and m, the time complexity 
stays linear, yielding overall complexity 0(n + m). Figure 2a and 2b illustrate multiple 
sorting with masks. 
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(a) String pool containing two close pairs whose distance is at most 2: (3,9) and (6,10) 
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(b) Close pairs can be detected by maslcing 2 characters In every possible way and 
perform sorting on the masked strings. We show the masks to detect (3,9) and (6,10). 



0001 0011 ; 
0001 0111 : 
1001 0111 lOl 
0011 1001 01 
0010 1110 loj 
1111 0011 : 
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(c)Blockwise masking. 



7:0000 0001 
4:0100 0001 
8:0101 1001 
10:1001 0011 
5:1010 0810 
1:1011 1111 
3:1100 looe 
2:1101 em 
9:iiei leee 
6:1111 eeii 
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I 0010 1110 
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I leee iiei 
1 leei 8111 
1 1111 eeii 




Figure 2: Multiple sorting method. 



2.2 Blockwise Masking 

Although the above method is optimal in terms of complexity, the practical computation 
time is not always optimal because a large number of sorting operations are necessary. To 
reduce the number of sorting operations, blockwise masking comes in useful. 

Let us divide the strings into k blocks of approximately equal length as in Figure 2c. 
Define 5 as a set of d distinct integers taken from {1, . . . , A;}. Denote by sf the i-th string 
whose blocks listed in B are removed. If HamDist{si^ sj) < d, then there exists B such 
that sf = . However, the inverse docs not hold. When pairs are enumerated by trying 
every possible B and sorting the masked strings as before, the solution set contains all 
neighbor pairs as well as a certain number of non-neighbor pairs. To filter out non-neighbor 
pairs, we need to calculate the actual Hamming distances. Since distance calculation is 
done only for pairs falling into an equivalence class, the number of distance calculations 
is much smaller than exhaustive comparison. Indeed, the number of sorting operations 

is significantly reduced: from ^ ^ ^ to ^ ^ ^ . In the example shown in Figure 2c, the 

number of sorting operations is reduced from 120 to 6. Due to space limitation, we cannot 
write full details of the computational tricks. For further description, see Appendix and the 
original paper (Uno, 2008). 
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2.3 Output Sensitive Complexity 

The computational complexity of MSM depends on the number of solutions m. Such com- 
plexity is called output sensitive (Johnson ct al., 1988). Sometimes, the complexity de- 
pending on the input only is not a good description of reality. For example, the input-only 
complexity of MSM is O(n^), which is evaluated in the worst case that all strings are identi- 
cal. One drawback of output sensitive complexity is that it cannot be used for foreseeing the 
computational time before running the algorithm. However, it is very useful for efficiency 
comparison among algorithms. 



3. Locality Sensitive Hashing 

In this section, we review existing locality sensitive hashing (LSH) methods for nearest 
neighbor discovery. 



3.1 Basic Idea 

Locality sensitive hashing is a random mapping from a vector to a string of integers, — > 
I^. The data points are mapped to the strings and stored in a hash table. When a query 
point is given, it is converted to a string, which is then used as a key to retrieve the 
data points with the same key. Finally, the retrieved points whose actual distances are 
within e are reported. It is known that a single mapping cannot control the number of 
false negatives (i.e., neighbors with non-identical keys) (Datar et al., 2004). Therefore, it is 
common to create multiple hash tables and use them simultaneously. Historically, the first 
LSH algorithm has been proposed for the Hamming space (Gionis et al., 1999). Later, it 
has been extended to the Euclidean distance (Datar et al., 2004). A similar mapping for 
the cosine distance was originally used in approximating the max-cut problem (Goemans 
and Williamson, 1995). In the following, we focus on cosine LSH that is actually used in 
our experiments. Notice, however, that any kind of LSH can be combined with MSM in 
principle. 



3.2 Cosine LSH 

Denote n data points in 3ft^ by Xi, . . . , Xn- Let us take the cosine distance as our metric. 

Aixi,xj) = l--r^l^. (2) 

We would like to enumerate all pairs whose cosine distance is at most e. 

Let R G ^^^^ be a random matrix consisting of i.i.d. samples from the standard normal 
distribution iV(0, 1). Let Si, . . . , s„ be bit strings of length £ consisting of '0' and '1'. The 
projection is defined as 

Sik ■■= sign{rjxi), (3) 

where Sjfc is the k-th character of the i-th string, is the fe-th column of R and sign(t) 
produces the character '1' if t > and '0' otherwise. It is known (Broder et al., 2000) that 
the following relationship holds: 

Pr(sifc7^s,fc) = ^, Vfc, (4) 

TT 
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where 6ij is the angle between Xi and Xj: 



9ij = arccos 



This relationship guarantees that the expected value of HamDist{si, Sj) is a monotonically 
increasing function of the cosine distance. Furthermore, in the limit £ — ^ oo, the Hamming 
distance between two bit strings converge to the angle of the original vectors. 

lim HamDist{si, Sj) = —. 

t-^oo TT 

6 ■ ■ 

When i is finite, HamDist{si, Sj) is subject to the binomial distribution Binom(L, -^). 

4. Multiple Sorting Method for Continuous Data 

This section presents our method for neighborhood graph construction. 

4.1 Basic Idea 

The basic idea is to map the data points to strings by LSH, and enumerate pairs of similar 
strings by MSM. However, considering the fact that the computational complexity of MSM 
is a polynomial function of the string length, it is not a good strategy to create long strings 
and process them by MSM at once. Thus, we employ the following replication strategy: for 
relatively small i, we create strings of length i from the data points by LSH. It is repeated Q 
times, resulting in Q independent string pools. Denote by sf the q-th. string corresponding 
to Xi- Then, MSM is applied to each string pool. The g-th output set of MSM is described 
as 

Eg = {(i, j) I HamDist{sj, sj) <d,i< j}. (5) 
Finally, the output sets are merged into one, 

E = EiU ■ ■ - U Eq. 

Then, we compute actual distances /S.{xi,Xj) for all pairs in E, and report the pairs that 
satisfy ^{xi,Xj) < e. 

4.2 Missing Edge Ratio 

Given the true edge set E* and our intermediate solution E, there are two kinds of error. 

• Type-I error (false positive): A non- neighbor pair has a Hamming distance within d 
in at least one replicate. 

F^ = {{i,j)\{i,j)eE,{i,j)iE*}. 

• Typc-II error (false negative): A neighbor pair has a Hamming distance larger than 
d in all replicates. 

F2 = {(i,i) I (i,i) ^E,(z,j) G^;*}. 
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Figure 3: The bound of missing edge ratio against the number of replicates. 

The type-II errors are more critical in our method because the type-I errors are eventu- 
ally filtered out by calculating the actual distances in the post-processing step. However, 
introducing too many type-I errors harms computational efficiency because distance 
calculations are necessary. 

The fraction of missing edges is defined as |F2|/|£'*|, whose expectation {missing edge 
ratio) is bounded as follows. 




where p is an upper bound of the non-collision probability (4) for neighbors. For the cosine 
LSH, p is set as follows, 

arccos(l — e) 
P = • 

TT 

There is no way to derive non-trivial type-I error bounds without the knowledge of data 
distribution. However, we found in empirical evaluations that the type-I error is typically 
small enough to keep our method competitive against other methods. 

Figure 3 depicts the missing edge ratio as a function of Q for different values of d. We 
used the cosine LSH where the radius is set such that p = 0.1 and L = 50. It is observed 
that many replicates are required to achieve small missing edge ratio when d = 0. This 
plot illustrates the difficulty of performing nearest neighbor search by hashing alone. As d 
increases, the number of required replicates reduces remarkably. 

4.3 Setting Parameters 

Our method has three parameters (d, i, Q). Users specify the radius e and the upper bound 
of the missing edge ratio 7. We need to specify the parameters to reduce the number of 
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typc-I and type-II errors as much as possible. Basically, we first choose i, then d, and finally 
choose the minimum Q such that the the missing edge ratio is smaller than 7. Towards the 
efficient computation, we observe the following. 

• Strings generated by random projection should be short, approximately, the compu- 
tation time of MSM linearly increases against the string length. 

• The length £ should also be short, but this increases the number of type-I errors. 

• The number of replicates Q should be small. 

• The number of mismatches d should be small. The computation time of MSM in- 
creases exponentially against the increase of d. 

Of course they conflict, thus we need to choose good parameters in some way. Actually, 
Q grows exponentially against the decrease of d, thus the computation time will be long 
if d, is too small or too large, and the best choice of d should be the middle. In our pre- 
experiments with fixed i and changed d and Q, when the number of replicates is large, say 
larger than 25, the increase d by one usually decreases the computation time. Thus, in our 
parameter setting, we first choose i, then choose the smallest d such that Q will be less 
than 25. 

In the computation of MSM, MSM chooses k — d blocks and sort the strings according 
to the blocks. Thus, we should have sufficiently many letters in one block so that the cost 
for pairwise comparison in the groups of strings having the same blocks. For the purpose, 
k — d blocks should have at least log2 n letters since the expected number of vectors having 
the same blocks will no greater than 1. Thus, we set i to 21og2n. This ensures that the 
efficiency of MSM for small d. For large d such as 10, MSM actually does not perform 
well, but in such a case even if we take large £, the performance is bad. This occurs when 
the threshold value is large, and there may be up to G(n^) close pairs. For such problems, 
straightforward pairwise comparison is the best choice. 

If the output pairs are few, type-I errors are also few, usually. In such cases, especially if 
n is quite large. The bottle neck of the computation is always the generation of the strings 
to be compared, because it involves huge number of inner products. In such cases, we should 
reduce i and increase d so that the computation time of MSM and random projection will 
be the same. This needs the estimation of the number of output pairs. In our method, 
we randomly sample the vector pairs, and roughly estimate the output pairs. Let S be 
the number of estimated output. In our method, we limit the number of replicates to 
max{30S'/n, 5}. The number 30 is chosen by pre-experiments; usually, type-I errors are no 
greater than 30 times more closed pairs. We also slightly shorten £ according to S; when S 
is closed to 0, we decrease £ to half. 

For the E^LSH, we have one more choice, the width parameter w. The probability of 
that two vectors have the same hash value is small if w is small, and large if w is large. Thus, 
basically smaller w makes d larger, and large w involves many type-I errors. Thereby, we 
also use estimated output size S, and choose large w if S is small. In our pre-experiments 
with quite small threshold distance, w = 12 was the best. With large threshold, w = 3.5 
was the best. Type-I errors seems to increase mildly exponentially as the increase of w, 
thus according to those experiments, we considered that output pairs are many if 5 > lOn, 
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and we set w to 3.5 — — —■ We set w to 3.5 if w is smaller than 3.5, and set w to 12 if 
w is larger than 12. 

5. Experiments 

We here present the results of computational experiments to evaluate the practical efficiency. 
All experiments are executed in a Linux PC with Intel Core2Duo E8400 of 3.0GHz with 
4GB memory. 

5.1 Methods in Comparison 

Our method is compared with the cover tree (Beygelzimer et al., 2006), a state-of-the-art 
exact nearest neighbor method. In the original paper, they perform k-nearest neighbor 
search only, not e-neighbor search. However, the cover tree can be used for both pur- 
poses in principle. We downloaded the template code, http://hunch.net/~jl/projects/ 
cover_tree/cover_tree_2.tar .gz, and used the function epsilon_aearest_neighbor () 
developed for e-neighbor search. The cover tree is based on the Euclidean distance, while 
our method is based on the cosine distance (2). For fair comparison, vectors are centralized 
and normalized to have norm one, under which the squared Euclidean distance is equivalent 
to the doubled cosine distance. The radius for the Euclidean distance is set as \/2e where 
e is the radius for the cosine distance. 

In addition, we evaluated a valiant of our method where the distance threshold d is set to 
zero. Since this method essentially uses the locality sensitive hashing only, it is referred to as 
"LSH" below. The parameter setting procedure presented in Section 4.3 is not appropriate 
for LSH, because, at a small threshold of 7, the number of required replicates gets very 
large. So, we fixed the number of replicates Q = 300 a priori, and adjusted the string 
length £ to achieve the required level of type-II error. 

5.2 Efficiency Results 

The dataset we used is the set of small images collected by (Torralba et al., 2008). The 
dataset has 8 million images in total, but we used a smaller version containing 1.6 mil- 
lion images, which was immediately downloadable from http://people.csail.mit.edu/ 
torralba/tinyimages/. Images are converted to gray scale, yielding 1024 dimensional 
vectors. To observe the growth rate, we generated data of different sizes by taking the first 
n records. In neighborhood graph construction, we tried three different values of cosine 
distance radius e = 0.0123,0.0489 and 0.109 which translate to 0.057r, O.lOvr and O.IStt in 
terms of angle, respectively. It seems to be meaningless to try larger thresholds, because the 
number of edges reached 0.2% of all pairs when e = 0.109 and n = 320000. It means that 
each node has 640 neighbors on average. To make the number of missing edges negligibly 
small, the missing edge ratio bound was set to 7 = 1.0 x 10^^ for MSM and LSH. 

Figure 4 shows the efficiency results. Our MSM method was consistently more efficient 
than the cover tree. Already in small data sizes (si 40000) , the difference is clear. The time 
of cover tree contains both tree construction and the searches over the tree. One important 
thing to notice, however, is that MSM cannot process new query points, while the cover 
tree can. The cover tree spends much time to prepare a shallow search tree to enable fast 
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Figure 4: Comparison of computation time. 



lE+09 




Figure 5: Comparison of the number of neighbor pairs m. 



search for unknown query points. Our algorithm does not need to construct such a tree, 
because our task is hmited to the data points at hand. 

Figure 5 shows the comparison of the number of neighbor pairs m. This shows that 
irrespective of the threshold, the number of neighbor pairs m gets approximately three 
times larger when the number of vectors n is doubled. This implies that m is a super- 
linear function of n; this phenomenon is essentially different from fc-nearest neighbor graph 
construction in which m is linear in n. 
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Table 1: Computational Time of MSM for 1.6 million images. 



Angle Tliresliold 


#neigliV)or ]")airs 


lime (sec) 


O.Olvr 


3382 


1409 


0.027r 


36881 


3277 


O.OStt 


334462 


4391 


0.047r 


1533622 


7525 



The computational results on 1.6 million images are separately summarized in Table 1. 
At the angle threshold 0.047r, MSM finishes in around 2.09 hours creating around 1.5 million 
edges. It corresponds to around 1.9 neighbor per node. 

5.3 Empirical Missing Edge Ratio 

Using the dataset with 160000 points, we plotted the empirical ratio of missing edges in 
Figure 6. This is not a theoretical bound, but actual counts. As expected from the bound, 
the missing ratio is reduced exponentially as the number of replicates increases. In the 
experiments of Figure 4, we set the threshold to 1.0 x 10~^, but it is quite easy to impose 
an even stricter threshold without large loss of efficiency. 

5.4 Output Sensitivity 

Output sensitive methods can finish the task quickly if the number of solutions is small. 
In our problem, MSM finishes quickly if e is small. Not all algorithms have this property. 
For example, the naive pairwise calculation spends the same amount of time regardless of 
e. This property is guaranteed theoretically for MSM, but we would like to see that MSM 
has output sensitivity empirically as well. Figure 7 shows the computation time per 10,000 
output pairs. The curve is flat or slightly decreasing, showing that MSM is output sensitive 
in fact. 

6. Discussion and Concluding Remarks 

In this paper, we started by defining the problem of neighborhood graph construction and 
characterizing the difference from neighbor search. We have shown that our method, an 
extension of MSM for continuous domains, can be more efficient than the cover tree with a 
negligible ratio of missing edges. Here we combined MSM with random projections, but it is 
straightforward to combine MSM with other non-random hashing methods such as spectral 
hashing (Weiss et al., 2009) and semantic hashing (Salakhutdinov and Hinton, 2007). Such 
methods put emphasis on preserving important statistical structure rather than mapping 
similar vectors to similar strings. 

The idea of using sorting operations for finding neighbors initiated by (Uno, 2008) 
is quite stimulating. The topic of this paper was the combination of MSM and random 
projections, but this is not the end of story for us. In future work, we would like to explore 
the following questions. 1) Is it really necessary to map the vectors to discrete symbols? 
Apart from locality sensitive hashing, there are a variety of random projection algorithms 
that maps a vector to a real- value (Li et al., 2006). Can they be used for better efficiency? 
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Figure 6: Empirical ratio of missing edges. There were no missing edges when the number 
of rephcates Q > 5 and the angle threshold is O.lvr and O.OSvr. 
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Figure 7: Computation time per 10,000 output pairs. 
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Sorting continuous values takes more time than radix sort, but it is possible in 0(n log n) 
time. 2) Hierarchical organization of the multiple sorting method. Our current algorithm 
has a two-level structure that Q replicates are created, each of which is divided into k 
blocks. We are not yet sure if it is the optimal architecture. Further efficiency could 
possibly be achieved by organizing the algorithm in a hierarchy of more than two levels. 
3) Implementation of MSM in a many-core processor. Though we performed all sorting 
operations serially, MSM is inherently amenable to parallelization. In the coming era of 
many-core processors and massive parallelism (Manferdelli et al., 2008), MSM is highly 
promising as a standard method for neighborhood graph construction. Fortunately, radix 
sort is available in most GPU libraries. It would be interesting to implement MSM on a 
GPU and see how fast it can be. 

Appendix: Acceleration Tricks for MSM 

In actual implementation of MSM, some computational tricks are effective in reducing the 
computational cost. In this section, we briefly describe some of them. 

The first trick is for checking duplications in . . . ,Eq. In taking the union of them 
£^ = £^1 U • • • U Eq, we do the following. Suppose that Q strings s^, . . . , are generated 
for the j-th vector by random projections. Even if we find a pair 

and Sj having a Hamming distance at most d for some h, we do not immediately output 
it to E. In that case, we compare sf and for all 1 < ^ < ^, and, only if none of the 
compared pairs has the distance at most d, {s^,s^) is added to E. In this way, we can 
ensure that no duplication happens. 

The second trick is about the speed-up of radix sort. The time complexity of the radix 
sort is 0((n-|-S)£), where S is the size of alphabet used in the input strings. In our problem, 
the alphabet is {0, 1}, thus |S| = 2. As it is very small compared to n, it is possible to 
save the cost by unifying several letters into one letter. For example, we unify every 20 
letters into one letter so that the size of alphabet becomes 2^^ . This reduces the number 
of iterations in a radix sort from £ to [^/20]. Radix sort inserts each string to one of E 
buckets, then scan all the buckets in the increasing order of letters. In the case with large 
we can reduce the computation time by scanning only non-empty buckets. It means 
that we do not need to sort the strings fully, but only to order them such that the same 
strings appear consecutively. This reduces the time complexity from 0{{n + T)t) to 0{n£), 
except for initialization. Initialization is not problematic, because it is performed only once, 
while radix sort is done many times. 
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